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Abstract 

We examine some differential geometric approaches to finding ap- 
proximate solutions to the continuous time nonlinear filtering problem. 
Our primary focus is a projection method using the direct I? metric 
onto a family of normal mixtures. We compare this method to earlier 
projection methods based on the Hellinger distance/Fisher metric and 
exponential families, and we compare the L 2 mixture projection fil- 
ter with a particle method with the same number of parameters. We 
study particular systems that may illustrate the advantages of this 
filter over other algorithms when comparing outputs with the opti- 
mal filter. We finally consider a specific software design that is suited 
for a numerically efficient implementation of this filter and provide 
numerical examples. 

Keywords: Finite Dimensional Families of Probability Distributions, Ex- 
ponential Families, Mixture Families, Hellinger distance, Fisher information 
metric, Direct L2 metric, Stochastic filtering 
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1 Introduction 

In the nonlinear filtering problem one observes a system whose state is known 
to follow a given stochastic differential equation. The observations that have 
been made contain an additional noise term, so one cannot hope to know 
the true state of the system. However, one can reasonably ask what is the 
probability density over the possible states. 

When the observations are made in continuous time, the probability den- 
sity follows a stochastic partial differential equation known as the Kushner- 
Stratonovich equation. This can be seen as a generalization of the Fokker- 
Planck equation that expresses the evolution of the density of a diffusion 
process. Thus the problem we wish to address boils down to finding approx- 
imate solutions to the Kushner-Stratonovich equation. 

For a quick introduction to the filtering problem see Davis and Marcus 
(1981) [15] . For a more complete treatment from a mathematical point of 
view see Lipster and Shiryayev (1978) [29]. See Jazwinski (1970) [21] for a 
more applied perspective. For recent results see the collection of papers [Ti] . 

The main idea we will employ is inspired by the differential geometric 
approach to statistics developed in [2] and |32j . One thinks of the probability 
distribution as evolving in an infinite dimensional space V which is in turn 
contained in some Hilbert space H. One can then think of the Kushner- 
Stratonovich equation as defining a vector field in V: the integral curves of 
the vector field should correspond to the solutions of the equation. To find 
approximate solutions to the Kushner-Stratonovich equation one chooses a 
finite dimensional submanifold M of H and approximates the probability 
distributions as points in M. At each point of M one can use the Hilbert 
space structure to project the vector field onto the tangent space of M. One 
can now attempt to find approximate solutions to the Kushner-Stratonovich 
equations by integrating this vector field on the manifold M. 

This mental image is slightly innaccurate. The Kushner-Stratonovich 
equation is a stochastic PDE rather than a PDE so one should imagine some 
kind of stochastic vector field rather than a smooth vector field. Thus in this 
approach we hope to approximate the infinite dimensional stochastic PDE 
by solving a finite dimensional stochastic ODE on the manifold. 
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Note that our approximation will depend upon two choices: the choice 
of manifold M and the choice of Hilbert space structure H. In this paper 
we will consider two possible choices for the Hilbert space structure: the 
direct L 2 metric on the space of probability distributions; the Hilbert space 
structure associated with the Hellinger distance and the Fisher Information 
metric. Our focus will be on the direct L 2 metric since projection using the 
Hellinger distance has been considered before. As we shall see, the choice 
of the "best" Hilbert space structure is determined by the manifold one 
wishes to consider — for manifolds associated with exponential families of 
distributions the Hellinger metric leads to the simplest equations, whereas 
the direct L 2 metric works well with mixture distributions. 

We will write down the stochastic ODE determined by this approach 
when H = L 2 and show how it leads to a numerical scheme for finding 
approximate solutions to the Kushner-Stratonovich equations in terms of a 
mixture of normal distributions. We will call this scheme the L 2 normal 
mixture projection filter or simply the L2NM projection filter. 

The stochastic ODE for the Hellinger metric was considered in [12], [9] 
and [10J. In particular a precise numerical scheme is given in [12] for finding 
solutions by projecting onto an exponential family of distributions. We will 
call this scheme the Hellinger exponential projection filter or simply the HE 
projection filter. 

We will compare the results of a C++ implementation of the L2NM pro- 
jection filter with a number of other numerical approaches including the HE 
projection filter and the optimal filter. We can measure the goodness of our 
filtering approximations thanks to the geometric structure and, in particular, 
the precise metrics we are using on the spaces of probability measures. 

What emerges is that the two projection methods produce excellent re- 
sults for a variety of filtering problems. The results appear similar for both 
projection methods; which gives more accurate results depends upon the 
problem. 

As we shall see, however, the L2NM projection approach can be im- 
plemented more efficiently In particular one needs to perform numerical 
integration as part of the HE projection filter algorithm whereas all integrals 
that occur in the L2NM projection can be evaluated analytically. 

We also compare the L2NM filter to a particle filter with the best possible 
combination of particles with respect to the Levy metric. Introducing the 
Levy metric is needed because particles densities do not compare well with 
smooth densities when using L 2 induced metrics. We show that, given the 
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same number of parameters, the L2NM may outperform a particles based 
system. 

The paper is structured as follows: In Section [2] we introduce the nonlinear 
filtering problem and the infinite-dimensional Stochastic PDE (SPDE) that 
solves it. In Section[3]we introduce the geometric structure we need to project 
the filtering SPDE onto a finite dimensional manifold of probability densities. 
In Section [4] we perform the projection of the filtering SPDE according to 
the L2NM framework and also recall the HE based framework. In Section 
[5] we briefly discuss the numerical implementation, while in Section [6] we 
discuss in detail the software design for the L2NM filter. In Section [7] we 
look at numerical results, whereas in Section [8] we compare our outputs with 
a particle method. Section [9] concludes the paper. 

2 The non-linear filtering problem 
with continuous-time observations 

In the non-linear filtering problem the state of some system is modelled by 
a process X called the signal. This signal evolves over time t according to 
an Ito stochastic differential equation (SDE). We measure the state of the 
system using some observation Y. The observations are not accurate, there 
is a noise term. So the observation Y is related to the signal X by a second 
equation. 

dX t = f t (X t )dt + a t (X t )dW t , X , 

(1) 

dY t = b t (X t ) dt + dV t , Y = 0. 

In these equations the unobserved state process {X t ,t > 0} takes values 
in IR n , the observation {Y t ,t > 0} takes values in IR d and the noise processes 
{Wt,t > 0} and {Vt,t > 0} are two Brownian motions. 

The nonlinear filtering problem consists in finding the conditional proba- 
bility distribution it t of the state X t given the observations up to time t and 
the prior distribution 7r for X . 

Let us assume that X , and the two Brownian motions are independent. 
Let us also assume that the covariance matrix for V t is invertible. We can 
then assume without any further loss of generality that its covariance matrix 
is the identity. We introduce a variable a t defined by: 
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Of = <J t cr t 

With these preliminaries, and a number of rather more technical condi- 
tions which we will state shortly, one can show that n t satisfies the a stochas- 
tic PDE called the Kushner-Stratonovich equation. This states that for any 
compactly supported test function <f> defined on E n 

n {<j>) = 7r o (0)+ f -K s (C s <f>) ds+T /Ws0)-^sK(0)] [dY s k -7T s (b k s )d S ] , 

Jo k=1 Jo 

(2) 

where for all t > 0, the backward diffusion operator C t is defined by 

i=l *J = 1 

Equation ^ involves the derivatives of the test function because of the 
expression C s (p. We assume now that n t can be represented by a density p t 
with respect to the Lebesgue measure on M. n for all time t > and that we 
can replace the term involving £ s <j) with a term involving its formal adjoint 
C*. Thus, proceeding formally, we find that pt obeys the following Ito-type 
stochastic partial differential equation (SPDE): 

d 

dp t = £* tPt dt + Y,Pt[bt ~ E P MWY t k - E pt {b k }dt] 

k=l 

where E pt {-} denotes the expectation with respect to the probability density 
Pt (equivalently the conditional expectation given the observations up to time 
t). The forward diffusion operator C* is defined by: 

i=l i,j'=l J 

This equation is written in Ito form. When working with stochastic cal- 
culus on manifolds it is necessary to use Stratonovich SDE's rather than Ito 
SDE's. This is because one does not in general know how to interpret the 
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second order terms that arise in Ito calculus in terms of manifolds. The in- 
terested reader should consult [16J. A straightforward calculation yields the 
following Stratonvich SPDE: 

d 

dpt = Ctvtdt - \ V t [N 2 - *UN 2 }] dt + J> [b k t - E pt {b k t }] o dY t k . 

k=l 

We have indicated that this is the Stratonovich form of the equation by 
the presence of the symbol ! o' inbetween the diffusion coefficient and the 
Brownian motion of the SDE. We shall use this convention throughout the 
rest of the paper. 

In order to simplify notation, we introduce the following definitions : 
7t°(p) := \[\h? ~ E p {\b t \*}\ p, 

(3) 

7t fc (p) := [b k - E p {b k }]p , 

for k = 1, • • • , d. The Str form of the Kushner-Stratonovich equation reads 
now 

d 

dpt = C* t V t dt - 7t °(p t ) dt + Y, lt(Pt) o dY k . (4) 

k=l 

Thus, subject to the assumption that a density p t exists for all time and 
assuming the necessary decay condition to ensure that replacing £ with its 
formal adjoint is valid, we find that solving the non-linear filtering problem 
is equivalent to solving this SPDE. Numerically approximating the solution 
of equation Q is the primary focus of this paper. 

For completeness we review the technical conditions required in order for 
equation [2] to follow from Q. 

(A) Local Lipschitz continuity : for all R > 0, there exists Kr > such 
that 

\ft(x)-f t (x')\ < K R \x-x'\ and \\a t (x)-a t (x')\\ < K R \x-x'\ , 
for all t > 0, and for all x, x' G Br, the ball of radius R. 

(B) Non-explosion : there exists K > such that 

x T ft(x) < K (1 + \x\ 2 ) and trace a t (x) < K (1 + \x\ 2 ) , 
for all t > 0, and for all x E R n . 
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(C) Polynomial growth : there exist K > and r > such that 

\b t (x)\ <K(l + \x\ r ) , 

for all t > 0, and for all x £ R n . 

Under assumptions (A) and (B), there exists a unique solution {X t , t > 
0} to the state equation, see for example [24], and X t has finite moments of 
any order. Under the additional assumption (C) the following finite energy 
condition holds 

E [ \b t {X t )\ 2 dt < oo , forallT>0. 
Jo 

Since the finite energy condition holds, it follows from Fujisaki, Kallian- 
pur and Kunita [22] that {nt , t > 0} satisfies the Kushner-Stratonovich 
equation [2] 

3 Statistical manifolds 
3.1 Families of distributions 

As discussed in the introduction, the idea of a projection filter is to ap- 
proximate solutions to the Kushner-Stratononvich equation [2] using a finite 
dimensional family of distributions. 

Example 3.1 A normal mixture family contains distributions given by: 

with Aj > and Aj = 1. It is a 3m — 1 dimensional family of distributions. 

Example 3.2 A polynomial exponential family contains distributions given 
by: 

m 

p = exp(^^ OiX 1 ) 

i=0 

where ao is chosen to ensure that the integral ofp is equal to 1. To ensure the 
convergence of the integral we must have that m is even and a m < 0. This 
is an m dimensional family of distributions. Polynomial exponential families 
are a special case of the more general notion of an exponential family, see for 
example JE/. 



J. Armstrong, D. Brigo. Stochastic Filtering via L 2 Projection 



8 



A key motivation for considering these families is that one can reproduce 
many of the qualitative features of distributions that arise in practice using 
these distributions. For example, consider the qualitative specification: the 
distribution should be bimodal with peaks near —1 and 1 with the peak at 
— 1 twice as high and twice as wide as the peak near 1. One can easily write 
down a distribution of this approximates form using a normal mixture. 

To find a similar exponential family, one seeks a polynomial with: local 
maxima at —1 and 1; with the maximum values at these points differing 
by log(2); with second derivative at 1 equal to twice that at —1. These 
conditions give linear equations in the polynomial coefficients. Using degree 
6 polynomials it is simple to find solutions meeting all these requirements. 
A specific numerical example of a polynomial meeting these requirements is 
plotted in Figure [T] The associated exponential distribution is plotted in 
Figure [2} 




Figure 1: y = -18.98 - 13.15a; + 23.54a; 2 + 25.43a; 3 + 13.96a; 4 - 12.63a; 5 - 
17.15a; 6 

We see that normal mixtures and exponential families have a broadly 
similar power to describe the qualitative shape of a distribution using only a 
small number of parameters. Our hope is that by approximating the prob- 
ability distributions that occur in the Kushner-Stratonovich equation by 
elements of one of these families we will be able to derive a low dimensional 
approximation to the full infinite dimensional stochastic partial differential 
equation. 
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Figure 2: y = exp(-18.98 - 13. 15x + 23.54a; 2 + 25.43a; 3 + 13.96a; 4 - 12.63a; 5 - 
17.15a; 6 ) 

3.2 Two Hilbert spaces of probability distributions 

We have given direct parameterisations of our families of probability distri- 
butions and thus we have implicitly represented them as finite dimensional 
manifolds. In this section we will see how families of probability distribu- 
tions can be thought of as being embedded in a Hilbert space and hence they 
inherit a manifold structure and metric from this Hilbert space. 

There are two obvious ways of thinking of embedding a probability density 
function on IR n in a Hilbert space. The first is to simply assume that the 
probability density function is square integrable and hence lies directly in 
L 2 (R n ). The second is to use the fact that a probability density function 
lies in L 1 (]R™) and is non-negative almost everywhere. Hence y/p will lie in 



For clarity we will write L^(M. n ) when we think of L 2 (IR n ) as containing 
densities directly. The D stands for direct. We write D C L|,(IR n ) where T> 
is the set of square integrable probability densities (functions with integral 1 
which are positive almost everywhere). 

Similarly we will write L 2 ? (R n ) when we think of L 2 (M. n ) as being a space 
of square roots of densities. The H stands for Hellinger (for reasons we will 
explain shortly). We will write TL for the subset of L 2 H consisting of square 
roots of probability densities. 

We now have two possible ways of formalizing the notion of a family of 
probability distributions. In the next section we will define a smooth family 



L 2 (R n ). 
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of distributions to be either a smooth submanifold of L 2 D which also lies in T> 
or a smooth submanifold of L 2 H which also lies in H. Either way the families 
we discussed earlier will give us finite dimensional families in this more formal 
sense. 

The Hilbert space structures of L 2 D and L 2 H allow us to define two notions 
of distance between probability distributions which we will denote do and 
de\ Given two probability distributions p\ and P2 we have an injection i into 
L 2 so one defines the distance to be the norm of t(pi) — l{j>2)- So given two 
probability densities p\ and P2 on IR n we can define: 



Here d/x is the Lebesgue measure. d# defines the Hellinger distance be- 
tween the two distributions, which explains are use of if as a subscript. We 
will write (•,•)# for the inner product associated with d# and (•, -)r> or simply 
(•, •) for the inner product associated with d D . 

In this paper we will consider the projection of the conditional density 
of the true state of the system given the observations (which is assumed to 
lie in T> or %) onto a submanifold. The notion of projection only makes 
sense with respect to a particular inner product structure. Thus we can 
consider projection using d H or projection using d D . Each has advantages 
and disadvantages. 

The most notable advantage of the Hellinger metric is that the dn metric 
can be defined independently of the Lebesgue measure and its definition 
can be extended to define the distance between measures without density 
functions (see Jacod and Shiryaev [20] or Hanzon [18]). In particular the 
Hellinger distance is indepdendent of the choice of parameterization for M n . 
This is a very attractive feature in terms of the differential geometry of our 
set up. 

Despite the significant theoretical advantages of the du metric, the do 
metric has an obvious advantage when studying mixture families: it comes 
from an inner product on L 2 D and so commutes with addition on L 2 D . So 
it should be relatively easy to calculate with the dp metric when adding 
distributions as happens in mixture families. As we shall see in practice, when 
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one performs concrete calculations, the dn metric works well for exponential 
families and the do metric works well for mixture families. While the du 
metric leads to the Fisher Information and to an equivalence with Assumed 
Density Filters when used on exponential families, see [10], the do metric for 
simple mixture families is equivalent to a Galerkin method, see for example 



3.3 The tangent space of a family of distributions 

To make our notion of smooth families precise we need to explain what we 
mean by a smooth map into an infinite dimensional space. 

Let U and V be Hilbert spaces and let / : U — >■ V be a continuous map 
(/ need only be defined on some open subset of U). We say that / is Frechet 
differentiable at x if there exists a bounded linear map A : U — > V satisfying: 

r \\f(h) - f(x) - Ah\\ v 

lim iiTTi 

h^x \\h\\u 

If A exists it is unique and we denote it by Df(x). This limit is called the 
Frechet derivative of / at x. It is the best linear approximation to / at in 
the sense of minimizing the norm on V. 

This allows us to define a smooth map / : U — )■ V defined on an open 
subset of U to be an infinitely Frechet differentiable map. We define an 
immersion of an open subset of lR n into V to be a map such that Df(x) is 
injective at every point where / is defined. The latter condition ensures that 
the best linear approximation to / is a genuinely n dimensional map. 

Given an immersion / defined on a neighbourhood of x, we can think of 
the vector subspace of V given by the image of D/(x) as representing the 
tangent space at x. 

To make these ideas more concrete, let us suppose that p(9) is a probabil- 
ity distribution depending smoothly on some parameter 9 = (9i, 9i, ■ ■ ■ , rn ) G 
U where U is some open subset of lR m . The map 9 — > p{9) defines a map 
i : U —7- V. At a given point 9 G U and for a vector h = (hi, h 2 , ... , h m ) G M m 
we can compute the Frechet derivative to obtain: 

m ft 
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L 2 



So we can identify the tangent space at 9 with the following subspace of 



D- 

dp dp dp 

We can formally define a smooth n-dimensional family of probability dis- 
tributions in L 2 D to be an immersion of an open subset of M. n into D. Equiv- 
alently it is a smoothly parameterized probability distribution p such that 
the above vectors in L 2 are linearly independent. 

We can define a smooth m-dimensional family of probability distributions 
in L 2 H in the same way. This time let q(9) be a square root of a probability 
distribution depending smoothly on 9. The tangent vectors in this case will 
be the partial derivatives of q with respect to 9. Since one normally prefers 
to work in terms of probability distributions rather than their square roots 
we use the chain rule to write the tangent space as: 

\ dp 1 dp 1 dp 

We have defined a family of distributions in terms of a single immersion / 
into a Hilbert space V. In other words we have defined a family of distribu- 
tions in terms of a specific parameterization of the image of /. It is tempting 
to try and phrase the theory in terms of the image of /. To this end, one 
defines an embedded submanifold of V to be a subspace of V which is covered 
by immersions ft from open subsets of W 1 where each /j is a homeomorphisms 
onto its image. With this definition, we can state that the tangent space of 
an embedded submanifold is independent of the choice of parameterization. 

One might be tempted to talk about submanifolds of the space of prob- 
ability distributions, but one should be careful. The spaces % and V are 
not open subsets of L 2 H and L 2 D and so do not have any obvious Hilbert- 
manifold structure. To see why, consider Figure [3] where we have peturbed 
a probability distribution slightly by subtracting a small delta-like function. 



3.4 The Fisher information metric 

Given two tangent vectors at a point to a family of probability distributions 
we can form their inner product using (-,-)#. This defines a so-called Rie- 
mannian metric on the family. With respect to a particular parameterization 
9 we can compute the inner product of the i th and j th basis vectors given in 
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Figure 3: An element of L 2 arbitrarily close to the normal distribution but 
not in H 



equation [6j We call this quantity \g%j. 



-Mo) 



1 dp 1 dp 



"IjpdQi 2jpdQ rj 



)n 



1 dp dp 



pdOi 89 j 
dlogp dlogp 



86i ddj 
dlogp d\ogp 
~d6i d~6~ 



pdfi 



Up to the factor of t, this last formula is the standard definition for the Fisher 
information matrix. So our is the Fisher information matrix. We can now 
interpret this matrix as the Fisher information metric and observe that, up to 
the constant factor, this is the same thing as the Hellinger distance. See [2], 
[30] and [1] for more in depth study on this differential geometric approach 
to statistics. 



Example 3.3 The Gaussian family of densities can be parameterized using 
parameters mean \i and variance v. With this parameterization the Fisher 
metric is given by: 



g{v,v) 



l 
o 





l/(2v) 



The representation of the metric as a matrix depends heavily upon the 
choice of parameterization for the family. 
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Example 3.4 The Gaussian family may be considered as a particular expo- 
nential family with parameters 9\ and 9 2 given by: 

p(x, 9) = exp(#ia; + 6 2 x 2 - 

where ip(6) is chosen to normalize p. It follows that: 

1 



V>(0) = -log 



7T 



CI 

w 2 



This is related to the familiar parameterization in terms of fi and v by: 

u. = -0i/(20 2 ), v = a 2 = (l/9 2 - 6l/6 2 2 )/2 

One can compute the Fisher information metric relative to the parameteri- 
zation Q\ to obtain: 



9(0) 



-1/(202 



e 1 /(2e 2 ) 



9 1 /(26 2 ) l/(26 2 ) - 6 2 /(26i 



The particular importance of the metric structure for this paper is that 
it allows us to define orthogonal projection of L 2 H onto the tangent space. 

Suppose that one has m linearly independent vectors Wi spanning some 
subspace W of a Hilbert space V. By linearity, one can write the orthogonal 
projection onto W as: 



1=1 1=1 

for some appropriately chosen constants A V K Since II acts as the identity on 
uii we see that A 1 ^ must be the inverse of the matrix A^ = (wi,Wj). 

We can apply this to the basis given in equation [6j Defining to be the 
inverse of the matrix g^ we obtain the following formula for projection, using 
the Hellinger metric, onto the tangent space of a family of distributions: 



n ff M = £ 



i=i ij=i 



1 dp 
2~/^d9~ 



1 dp 
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3.5 The direct L 2 metric 

The ideas from the previous section can also be applied to the direct L 2 
metric. This gives a different Riemannian metric on the manifold. 

We will write h = to denote the L 2 metric when written with respect 
to a particular parameterization. 

Example 3.5 In coordinates fi, v, the L 2 metric on the Gaussian family is: 



h(fJL,v) 



1 



1 

f 



We can obtain a formula for projection in L 2 D using the direct L 2 metric 
using the basis given in equation [5| We write h % i for the matrix inverse of 
hi 



n D (v) 



£ 

i=l 



5>« 



dp \ 



D 



dp 



(7) 



4 The projection filter 

Given a family of probability distributions parameterised by 9, we wish to 
approximate an infinte dimensional solution to the non-linear filtering SPDE 
using elements of this family. Thus we take the Kushner-Stratonovich equa- 
tion [4], view it as defining a stochastic vector field in T> and then project that 
vector field onto the tangent space of our family. The projected equations 
can then be viewed as giving a stochastic differential equation for 9. In this 
section we will write down these projected equations explicitly. 

Let 9 — y p{9) be the parameterization for our family. A curve t — y 9(t) 
in the parameter space corresponds to a curve t —y p(-, 9(t)) in T>. For such 
a curve, the left hand side of the Kushner-Stratonovich equation [4] can be 
written: 

dtP(-At)) = E^^W) 

1=1 % 
III 

i=l 
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where we write = {vi} is the basis for the tangent space of the manifold 
at 0(t). 

Given the projection formula given in equation [7[ we can project the 
terms on the right hand side onto the tangent space of the manifold using 
the direct L 2 metric as follows: 



n e D [c*p) 



E 

i=i 

rn 

E 

i=l 
m 

E 

i=i 



m 

j=i 

m 



i=l 



Thus if we take L 2 projection of each side of equation Q we obtain: 



E"- d " - = E 



i=l 



i=l 



h ij \ (p, Cvj)dt - ( 7 °(p), ^>dt + ^(7 fe (p), ^) ° dE A 

.i=i 



fc=i 



Since the Vi form a basis of the tangent space, we can equate the coefficients 
of Vi to obtain: 



dff 1 = J>« { W),Cvj)dt- { 1 \ P {9)),v ] )At + Y,{l k {p{0)),v J ) o&Y k . 

3=1 { k=l J 

(8) 

This is the promised finite dimensional stochastic differential equation for 9 
corresponding to L 2 projection. 

If preferred, one could instead project the Kushner-Stratonovich equa- 
tion using the Hellinger metric instead. This yields the following stochastic 
differential equation derived originally in |12j : 



m ( r* i d \ 

= E^' ( — ,^)dt- (-|&| 2 ,^)dt + ^(6 fe ,^) odE fc 

3=1 \ P k=l J 



(9) 



Note that the inner products in this equation are the direct L 2 inner products: 
we are simply using the L 2 inner product notation as a compact notation for 
integrals. 
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5 Numerical implementation 

Equations [8] and [9] both give finite dimensional stochastic differential equa- 
tions that we hope will approximate well the solution to the full Kushner- 
Stratonovich equation. We wish to solve these finite dimensional equations 
numerically and thereby obtain a numerical approximation to the non-linear 
filtering problem. 

Because we are solving a low dimensional system of equations we hope 
to end up with a more efficient scheme than a brute-force finite difference 
approach. A finite difference approach can also be seen as a reduction of 
the problem to a finite dimensional system. However, in a finite difference 
approach the finite dimensional system still has a very large dimension, de- 
termined by the number of grid points into which one divides M. n . By contrast 
the finite dimensional manifolds we shall consider will be defined by only a 
handful of parameters. 

6 Software design 

The specific solution algorithm will depend upon numerous choices: whether 
to use L 2 or Hellinger projection; which family of probability distributions to 
choose; how to parameterize that family; the representation of the functions 
/, o and b; how to perform the integrations which arise from the calculation 
of expectations and inner products; the numerical method selected to solve 
the finite dimensional equations. 

To test the effectiveness of the projection idea, we have implemented a 
C++ engine which performs the numerical solution of the finite dimensional 
equations and allows one to make various selections from the options above. 
Currently our implementation is restricted to the case of the direct L 2 pro- 
jection for a 1-dimensional state X and 1-dimensional noise W. However, 
the engine does allow one to experiment with various manifolds, parameter- 
iziations and functions /, a and b. 

We use object oriented programming techniques in order to allow this 
flexibility. Our implementation contains two key classes FunctionRing and 
Manifold. 

To perform the computation, one must choose a data structure to rep- 
resent elements of the function space. However, the most effective choice 
of representation depends upon the family of probability distributions one 
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is considering and the functions /, a and b. Thus the C++ engine does not 
manipulate the data structure directly but instead works with the functions 
via the FunctionRing interface. A UML (Unified Modelling Language [33J) 
outline of the FunctionRing interface is given in table [T] 

FunctionRing 

+ add( fi : Function, f 2 : Function ) : Function 

+ multiply( fx : Function, f 2 : Function ) : Function 

+ multiply( s : Real, / : Function ) : Function 

+ differentiate( / : Function ) : Function 

+ integrate( / : Function ) : Real 

+ evaluate( / : Function ) : Real 

+ constantFunction( s : Real ) : Function 

Table 1: UML for the FunctionRing interface 



Manifold 

+ getRingQ : FunctionRing 
+ getDensity( 9 ) : Function 

+ computeTangentVectors( 9 : Point ) : Function* 
+ updatePoint( 9 : Point, A9 : Real* ) : Point 
+ finalizePoint( 9 : Point ) : Point 

Table 2: UML for the Manifold interface 

The other key abstraction is the Manifold. We give a UML representation 
of this abstraction in table |2j For readers unfamiliar with UML, we remark 
that the * symbol can be read "list" . For example, the computeTangent Vec- 
tors function returns a list of functions. 

The Manifold uses some convenient internal representation for a point, 
the most obvious representation being simply the m-tuple (9x,9 2 , . . . 9 m ). On 
request the Manifold is able to provide the density associated with any point 
represented as an element of the FunctionRing. 

In addition the Manifold can compute the tangent vectors at any point. 
The computeTangentVectors method returns a list of elements of the Func- 
tionRing corresponding to each of the vectors Vi = Jj^ in turn. If the point 
is represented as a tuple 9 = (9i, 9 2 , . . . 9 n ), the method updatePoint simply 
adds the components of the tuple A9 to each of the components of 9. If 
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a different internal representation is used for the point, the method should 
make the equivalent change to this internal representation. 

The finalizePoint method is called by our algorithm at the end of every 
time step. At this point the Manifold implementation can choose to change its 
parameterization for the state. Thus the finalizePoint allows us (in principle 
at least) to use a more sophisticated atlas for the manifold than just a single 
chart. 

One should not draw too close a parallel between these computing ab- 
stractions and similarly named mathematical abstractions. For example, the 
space of objects that can be represented by a given Function Ring do not need 
to form a differential ring despite the differentiate method. This is because 
the differentiate function will not be called infinitely often by the algorithm 
below, so the functions in the ring do not need to be infinitely differentiable. 

Similarly the finalizePoint method allows the Manifold implementation 
more flexibility than simply changing chart. From one time step to the 
next it could decide to use a completely different family of distributions. 
The interface even allows the dimension to change from one time step to the 
next. We do not currently take advantage of this possibility, but adapatively 
choosing the family of distributions would be an interesting topic for further 
research. 

6.1 Outline of the algorithm 

The C++ engine is initialized with a Manifold object, a copy of the initial 
Point and Function objects representing /, a and b. 

At each time point the engine asks the manifold to compute the tangent 
vectors given the current point. Using the multiply and integrate functions 
of the class Functioning, the engine can compute the inner products of any 
two functions, hence it can compute the metric matrix hy. Similarly, the 
engine can ask the manifold for the density function given the current point 
and can then compute C*p. Proceeding in this way, all the coefficients of dt 
and odF in equation [8] can be computed at any given point in time. 

Were equation [8] an Ito SDE one could now numerically estimate AO, the 
change in over a given time interval A in terms of A and Ay, the change 
in Y. One would then use the updateState method to compute the new point 
and then one could repeat the calculation for the next time interval. In other 
words, were equation [8] an Ito SDE we could numerically solve the SDE using 
the Euler scheme. 
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However, equation [8] is a Stratonovich SDE so the Euler scheme is no 
longer valid. Various numerical schemes for solving stochastic differential 
equations are considered in [13] and [25] . One of the simplest is the Stratonovich- 
Heun method described in [15] . Suppose that one wishes to solve the SDE: 

dy t = f(y t )dt + g{y t ) o dW t 

The Stratonvich-Heun method generates an estimate for the solution y n at 
the n-th time interval using the formulae: 

Y n+ i = y n + f(y n )A + g(y n )AW n 

Vn+i = y n + -(f{y n ) + f(Y n+1 ))A + -(g(y n ) + g{Y n+1 ))AW n 

In these formulae A is the size of the time interval and AW n is the change 
in W. One can think of Y n+1 as being a prediction and the value y n+ i as 
being a correction. Thus this scheme is a direct translation of the standard 
Euler-Heun scheme for ordinary differential equations. 

We can use the Stratonovich-Heun method to numerically solve equation 
|8j Given the current value 9 n for the state, compute an estimate for A9 n by 
replacing dt with A and dW with AW in equation [8j Using the updateState 
method compute a prediction O n+ i. Now compute a second estimate for A9 n 
using equation [8] in the state O n +i. Pass the average of the two estimates to 
the updateState function to obtain the the new state 9 n+ \. 

At the end of each time step, the method finalizeState is called. This 
provides the manifold implementation the opportunity to perform checks 
such as validation of the state, to correct the normalization and, if desired, 
to change the representation it uses for the state. 

One small observation worth making is that the equation [8] contains the 
term ft, 8 - 7 , the inverse of the matrix hij. However, it is not necessary to actually 
calculate the matrix inverse in full. It is better numerically to multiply 
both sides of equation [8] by the matrix and then compute d9 by solving 
the resulting linear equations directly. This is the approach taken by our 
algorithm. 

As we have already observed, there is a wealth of choices one could make 
for the numerical scheme used to solve equation[8j we have simply selected the 
most convenient. The existing Manifold and FunctionRing implementations 
could be used directly by many of these schemes — in particular those based 
on Runge-Kutta schemes. In principle one might also consider schemes that 
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require explicit formulae for higher derivatives such as dg .g g . ■ In this case one 
would need to extend the manifold abstraction to provide this information. 

Similarly one could use the same concepts in order to solve equation [9] 
where one uses the Hellinger projection. In this case the Function Ring would 
need to be extended to allow division. This would in turn complicate the 
implementation of the integrate function, which is why we have not yet im- 
plemented this approach. 



6.2 Implementation for normal mixture families 

Let TZ denote the space of functions which can be written as finite linear 
combinations of terms of the form: 

where n is non-negative integer and a, b and c are constants. TZ is closed 
under addition, multiplication and differentiation, so it forms a differential 
ring. 

We have written an implementation of Function Ring corresponding to 
TZ. Although the implementation is mostly straightforward some points are 
worth noting. 

Firstly, we store elements of our ring in memory as a collection of tuples 
(±, a, b, c, n). Although one can write: 

_j_^n^ax 2 +bx+c _ ^n^ax 2 +bx 

for appropriate q, the use or such a term in computer memory should be 
avoided as it will rapidly lead to significant rounding errors. A small amount 
of care is required throughout the implementation to avoid such rounding 
errors. 

Secondly let us consider explicitly how to implement integration for this 

2 

ring. Let us define u n to be the integral of x n e~ x . Using integration by parts 
one has: 

n -x> A n-1 f°° n _ 2 _ X 2. n-1 
u n ■= / x e dx = — - — / x e dx = — - — w n _ 2 



Since u = y/n and u\ = we can compute u n recursively. Hence we 
can analytically compute the integral of p(x)e~ x for any polynomial p. By 
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substitution, we can now integrate p(x — /i)e~^ _/i ^ 2 for any \l. By completing 
the square we can analytically compute the integral of p(x)e ax +bx+c so long 
as a < 0. Putting all this together one has an algorithm for analytically 
integrating the elements of 7Z. 

Let M l denote the space of probability distributions that can be written 
as X/l=i c k£ akX +bkX for some real numbers a^, and with < 0. Given 
a smooth curve j(t) in Af l we can write: 

i 

7 (t) = J2c k (t)e ak(t)x2+bk{t)x . 
k=i 

We can then compute: 

d7 A / / do* , d6fc \ afca;2+fefc:E dcfc 2+bfc ,\ 
^ ~ ^\\dt + dt^T* 6 + dt 6 J 

We deduce that the tangent vectors of any smooth submanifold of M l 
must also lie in TZ. In particular this means that our implementation of 
Function Ring will be sufficient to represent the tangent vectors of any mani- 
fold consisting of finite normal mixtures. 

Combining these ideas we obtain the main theoretical result of the paper. 

Theorem 6.1 Let 8 be a parameterization for a family of probability distri- 
butions all of which can be written as a mixture of at most i Gaussians. Let 
f , a = cr 2 and b be functions in the ring 1Z. In this case one can carry out 
the direct L 2 projection algorithm for the problem given by equation using 
analytic formulae for all the required integrations. 

Although the condition that /, a and b lie in 1Z may seem somewhat 
restrictive, when this condition is not met one could use Taylor expansions 
to find approximate solutions. 

Although the choice of parameterization does not affect the choice of 
Function Ring, it does affect the numerical behaviour of the algorithm. In 
particular if one chooses a parameterization with domain a proper subset of 
M. m , the algorithm will break down the moment the point 9 leaves the domain. 
With this in mind, in the numerical examples given later in this paper we 
parameterize normal mixtures of k Gaussians with a parameterization defined 
on the whole of W 1 . We describe this parameterization below. 
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Label the parameters £j (with 1 < i < k — 1), X\, (with 2 < i < k) and 
Si (with 1 < i < k). This gives a total of 3fc — 1 parameters. So we can write 

6 = (fi, • • • ,6-i>^i>2/2, • • • ,Vk,si, . . . ,s k ) 

Given a point 9 define variables as follows: 

Ai = logit' 1 ^) 

Xi = logit- 1 te)(l-A 1 -A 2 -...-A l _ 1 ) (2<z<A:-l) 
Afc = 1 — Ai — A2 — • . . — A,t_i 
Xi = Xi-i + e Vi (2 < % < k) 
°i = e s * 

where the logit function sends a probability p £ [0, 1] to its log odds, ln(p/l — 
p). We can now write the density associated with 9 as: 



r-r crjv27r 



fx - Xi) 2 , 



We do not claim this is the best possible choice of parameterization, but 
it certainly performs better than some more naive parameteriations with 
bounded domains of definition. We will call the direct L 2 projection algo- 
rithm onto the normal mixture family given with this projection the L2NM 
projection filter. 



6.3 Comparison with the Hellinger exponential (HE) 
projection algorithm 

A similar algorithm is described in 0, [TD] for projection using the Hellinger 
metric onto an exponential family. We refer to this as the HE projection 
filter. 

It is worth highlighting the key differences between our algorithm and the 
exponential projection algorithm described in [9]. 

• In [9] only the special case of the cubic sensor was considered. It was 
clear that one could in principle adapt the algorithm to cope with other 
problems, but there remained symbolic manipulation that would have 
to be performed by hand. Our algorithm automates this process by 
using the FunctionRing abstraction. 
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• When one projects onto an exponential family, the stochastic term 
in equation ^ simplifies to a term with constant coefficients. This 
means it can be viewed equally well as either an Ito or Stratonovich 
SDE. The practical consequence of this is that the HE algorithm can 
use the Euler-Maruyama scheme rather than the Stratonvoich-Heun 
scheme to solve the resulting stochastic ODE's. Moreover in this case 
the Euler-Maruyama scheme coincides with the generally more precise 
Milstein scheme. 

• In the case of the cubic sensor, the HE algorithm requires one to nu- 
merically evaluate integrals such as: 



where the 8i are real numbers. Performing such integrals numerically 
considerably slows the algorithm. In effect one ends up using a rather 
fine discretization scheme to evaluate the integral and this somewhat 
offsets the hoped for advantage over a finite difference method. 

7 Numerical Results 

In this section we compare the results of using the direct L 2 projection filter 
onto a mixture of normal distributions with other numerical methods. In 
particular we compare it with: 

1. A finite difference method using a fine grid which we term the ex- 
act filter. Various convergence results are known ([26] and [27]) for 
this method. In the simulations shown below we use a grid with 1000 
points on the x-axis and 5000 time points. In our simulations we could 
not visually distinguish the resulting graphs when the grid was refined 
further justifying us in considering this to be extremely close to the 
exact result. The precise algorithm used is as described in the section 
on "Partial Differential Equations Methods" in chapter 8 of Bain and 
Crisan [SJ. 

2. The extended Kalman filter (EK). This is a somewhat heuristic ap- 
proach to solving the non-linear filtering problem but which works well 




J. Armstrong, D. Brigo. Stochastic Filtering via L 2 Projection 



25 



so long as one assumes the system is almost linear. It is implemented 
essentially by linearising all the functions in the problem and then us- 
ing the exact Kalman filter to solve this linear problem - the details 
are given in j5|. The EK filter is widely used in applications and so 
provides a standard benchmark. However, it is well known that it can 
give wildly innaccurate results for non-linear problems so it should be 
unsurprising to see that it performs badly for most of the examples we 
consider. 

3. The HE projection filter. In fact we have implemented a generalization 
of the algorithm given in [12] that can cope with filtering problems 
where b is an aribtrary polynomial, o is constant and / = 0. Thus we 
have been able to examine the performance of the exponential projec- 
tion filter over a slightly wider range of problems than have previously 
been considered. 

To compare these methods, we have simulated solutions of the equations 
[I] for various choices of /, a and b. We have also selected a prior probability 
distribution po for X and then compared the numerical estimates for the 
probability distribution p at subsequent times given by the different algo- 
rithms. In the examples below we have selected a fixed value for the intial 
state X rather than drawing at random from the prior distribution. This 
should have no more impact upon the results than does the choice of seed 
for the random number generator. 

Since each of the approximate methods can only represent certain distri- 
butions accurately, we have had to use different prior distributions for each 
algorithm. To compare the two projection filters we have started with a 
polynomial exponential distribution for the prior and then found a nearby 
mixture of normal distributions. This nearby distribution was found using 
a gradient search algorithm to minimize the numerically estimated L 2 norm 
of the difference of the normal and polynomial exponential distributions. As 
indicated earlier, polynomial exponential distributions and normal mixtures 
are qualitatively similar so the prior distributions we use are close for each 
algorithm. 

For the extended Kalman filter, one has to approximate the prior dis- 
tribution with a single Gaussian. We have done this by moment matching. 
Inevitably this does not always produce satisfactory results. 

For the exact filter, we have used the same prior as for the L 2 projection 
filter. 
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7.1 The linear filter 



The first test case we have examined is the linear filtering problem. In this 
case the probability density will be a Gaussian at all times — hence if we 
project onto the two dimensional family consisting of all Gaussian distribu- 
tions there should be no loss of information. Thus both projection filters 
should give exact answers for linear problems. This is indeed the case, and 
gives some confidence in the correctness of the computer implementations of 
the various algorithms. 

7.2 The quadratic sensor 

The second test case we have examined is the quadratic sensor. This is prob- 
lem [l| with / = 0, a — Ci and b(x) = C2X 2 for some positive constants c\ and 
c 2 . In this problem the non-injectivity of b tends to cause the distribution at 
any time to be bimodal. To see why, observe that the sensor provides no in- 
formation about the sign of x, once the state of the system has passed through 
we expect the probability density to become approximately symmetrical 
about the origin. Since we expect the probability density to be bimodal for 
the quadratic sensor it makes sense to approximate the distribution with a 
linear combination of two Gaussian distributions. 

In Figure [4] we show the probability density as computed by three of the 
algorithms at 10 different time points for a typical quadratic sensor prob- 
lem. To reduce clutter we have not plotted the results for the exponen- 
tial filter. The prior exponential distribution used for this simulation was 
p(x) = exp(0.25 — x 2 + x 3 — 0.25a; 4 ). The initial state was X = and Y = 0. 

As one can see the probability densities computed using the exact filter 
and the L2NM filter become visually indistinguishable when the state moves 
away from the origin. The extended Kalman filter is, as one would expect, 
completely unable to cope with these bimodal distributions. In this case the 
extended Kalman filter is simply representing the larger of the two modes. 

In Figure [5] we have plotted the L 2 residuals for the different algorithms 
when applied to the quadratic sensor problem. We define the L 2 residual to 
be the L 2 norm of the difference between the exact filter distribution and the 
estimated distribution. 



L 2 residual 





Figure 4: Estimated probability densities at 10 time points for the probl 

b(x) = x 2 



J. Armstrong, D. Brigo. Stochastic Filtering via L 2 Projection 28 

As can be seen, the L2NM projection filter outperforms the HE projection 
filter when applied to the quadratic sensor problem. Notice that the L 2 resid- 
uals are initially small for both the HE and the L2NM filter. The superior 
performance of the L2NM projection filter in this case stems from the fact 
that one can more accurately represent the distributions that occur using the 
normal mixture family than using the polynomial exponential family. 

If preferred one could define a similar notion of residual using the Hellinger 
metric. The results would be qualitatively similar. 

One interesting feature of Figure [5] is that the error remains bounded in 
size when one might expect the error to accumulate over time. This suggests 
that the arrival of new measurements is gradually correcting for the errors 
introduced by the approximation. 
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Figure 5: L 2 residuals for the problem b(x) = x 2 
7.3 The cubic sensor 

A third test case we have considered is the general cubic sensor. In this 
problem one has / = 0, o = c\ for some constant c\ and b is some cubic 
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function. 

The case when 6 is a multiple of a; 3 is called the cubic sensor and was 
used as the test case for the exponential projection filter using the Hellinger 
metric considered in [12] . It is of interest because it is the simplest case where 
b is injective but where it is known that the problem cannot be reduced to 
a finite dimensional stochastic differential equation [19J. It is known from 
earlier work that the exponential filter gives excellent numerical results for 
the cubic sensor. 

Our new implementations allow us to examine the general cubic sensor. 
In Figure |6j we have plotted example probability densities over time for the 
problem with / = 0, a — 1 and b = x 3 — x. With two turning points for b 
this problem is very far from linear. As can be seen in Figure [6] the L2NM 
projection remains close to the exact distribution throughout. A mixture of 
only two Gaussians is enough to approximate quite a variety of differently 
shaped distributions with perhaps surprising accuracy. As expected, the 
extended Kalman filter gives poor results until the state moves to a region 
where b is injective. The results of the exponential filter have not been plotted 
in Figure [6] to reduce clutter. It gave similar results to the L2NM filter. 

The prior polynomial exponential distribution used for this simulation 
was p(x) = exp(0.5x 2 — 0.25x 4 ). The initial state was X = 0, which is one 
of the modes of prior distribution. The inital value for Y was taken to be 0. 

One new phenomenon that occurs when considering the cubic sensor is 
that the algorithm sometimes abruptly fails. This is true for both the L2NM 
projection filter and the HE projection filter. 

To show the behaviour over time more clearly, in Figure [7] we have shown a 
plot of the mean and standard deviation as estimated by the L2NM projection 
filter against the actual mean and standard deviation. We have also indicated 
the true state of the system. The mean for the L2MN filter drops to at 
approximately time 7. It is at this point that the algorithm has failed. 

What has happened is that as the state has moved to a region where 
the sensor is reasonably close to being linear, the probability distribution 
has tended to a single normal distribution. Such a distribution lies on the 
boundary of the family consisting of a mixture of two normal distributions. 
As we approach the boundary, hij ceases to be invertible causing the failure 
of the algorithm. Analogous phenomena occur for the exponential filter. 

The result of running numerous simulations suggests that the HE filter 
is rather less robust than the L2NM projection filter. The typical behaviour 
is that the exponential filter maintains a very low residual right up until the 




Figure 6: Estimated probability densities at 10 time points for the probl 

b(x) = x 3 — x 
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Figure 7: Estimates for the mean and standard deviation for the problem 
b(x) = x 3 — x 



J. Armstrong, D. Brigo. Stochastic Filtering via L 2 Projection 



32 



point of failure. The L2NM projection filter on the other hand tends to give 
slightly inaccurate results shortly before failure and can often correct itself 
without failing. 

This behaviour can be seen in Figure |8j In this figure, the residual for 
the exponential projection remains extremely low until the algorithm fails 
abruptly - this is indicated by the vertical dashed line. The L2NM filter on 
the other hand deteriorates from time 6 but only fails at time 7. 
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Figure 8: L 2 residuals for the problem b(x) = x 3 — x 

The L 2 residuals of the L2MN method are rather large between times 6 
and 7 but note that the accuracy of the estimates for the mean and standard 
deviation in Figure [7] remain reasonable throughout this time. To understand 
this note that for two normal distributions with means a distance x apart, the 
L 2 distance between the distributions increases as the standard deviations of 
the distributions drop. Thus the increase in L 2 residuals between times 6 and 
7 is to a large extent due to the drop in standard deviation between these 
times. As a result, one may feel that the L 2 residual doesn't capture precisely 
what it means for an approximation to be "good". In the next section we 
will show how to measure residuals in a way that corresponds more closely 
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to the intuitive idea of them having visually similar distribution functions. 
In practice one's definition of a good approximation will depend upon the 
application. 

Although one might argue that the filter is in fact behaving reasonably 
well between times 6 and 7 it does ultimately fail. There is an obvious fix for 
failures like this. When the current point is sufficiently close to the boundary 
of the manifold, simply approximate the distribution with an element of the 
boundary. In other words, approximate the distribution using a mixture of 
fewer Gaussians. Since this means moving to a lower dimensional family 
of distributions, the numerical implementation will be more efficient on the 
boundary. This will provide a temporary fix the failure of the algorithm, but 
it raises another problem: as the state moves back into a region where the 
problem is highly non linear, how can one decide how to leave the boundary 
and start adding additional Gaussians back into the mixture? We hope to 
address this question in a future paper. 

8 Comparison with Particle Methods 

Particle methods approximate the probability density p using discrete mea- 
sures of the form: 



These measures are generated using a Monte Carlo method. The measure 
can be thought of as the empirical distributions associated with randomly 
located particles at position Vi(t) and of stochastic mass a^t). 

Particle methods are currently some of the most effective numerical meth- 
ods for solving the filtering problem. See [5] and the references therein for 
details of specific particle methods and convergence results. 

The first issue in comparing projection methods with particle methods is 
that, as a linear combination of Dirac masses, one can only expect a particle 
method to converge weakly to the exact solution. In particular the L 2 metric 
and the Hellinger metric are both inappropriate measures of the residual 
between the exact solution and a particle approximation. Indeed the L 2 
distance is not defined and the Hellinger distance will always take the value 



To combat this issue, we will measure residuals using the Levy metric. If 
p and q are two probability measures on R and P and Q are the associated 




%/2. 
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cumulative distribution functions then the Levy metric is defined by: 

d L (p, q) = inf {e : P(x - e) - e < Q(x) < P(x + e) + e Vx} 

This can be interpreted geometrically as the size of the largest square with 
sides parallel to the coordinate axes that can be inserted between the com- 
pleted graphs of the cumulative distribution functions (the completed graph 
of the distribution function is simply the graph of the distribution function 
with vertical line segments added at discontinuities). 

The Levy metric can be seen as a special case of the Levy-Prokhorov 
metric. This can be used to measure the distance between measures on a 
general metric space. For Polish spaces, the Levy-Prokhorov metric metrises 
the weak convergence of probability measures pE]. Thus the Levy metric 
provides a reasonable measure of the residual of a particle approximation. 
We will call residuals measured in this way Levy residuals. 

A second issue in comparing projection methods with particle methods is 
deciding how many particles to use for the comparison. A natural choice is to 
compare a projection method onto an m-dimensional manifold with a particle 
method that approximates the distribution using \{m + l)/2] particles. In 
other words, equate the dimension of the families of distributions used for 
the approximation. 

A third issue is deciding which particle method to choose for the com- 
parison from the many algorithms that can be found in the literature. We 
can work around this issue by calculating the best possible approximation 
to the exact distribution that can be made using \(m + l)/2] Dirac masses. 
This approach will substantially underestimate the Levy residual of a parti- 
cle method: being Monte Carlo methods, large numbers of particles would 
be required in practice. 

In Figure [9] we have plotted bounds on the Levy residuals for the two 
projection methods for the quadratic sensor. Since mixtures of two normal 
distributions lie in a 5 dimensional family, we have compared these residuals 
with the best possible Levy residual for a mixture of three Dirac masses. 

To compute the Levy residual between two functions we have approxi- 
mated first approximated the cumulative distribution functions using step 
functions. We have used the same grid for these steps as we used to compute 
our "exact" filter. We have then used a brute force approach to compute a 
bound on size of the largest square that can be placed between these step 
functions. Thus if we have used a grid with n points to discretize the x- 
axis, we will need to make n 2 comparisons to estimate the Levy residual. 
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Figure 9: Levy residuals for the problem b(x) = x 2 
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More efficient algorithms are possible, but this approach is sufficient for our 
purposes. 

The maximum accuracy of the computation of the Levy metric is con- 
strained by the grid size used for our "exact" filter. Since the grid size in 
the x direction for our "exact" filter is 0.01, our estimates for the projection 
residuals are bounded below by 0.02. 

The computation of the minimum residual for a particle filter is a little 
more complex. Let minEpsilon(F, n) denote the minimum Levy distance 
between a distribution with cumulative distribution F and a distribution of 
n particles. Let minN(F, e) denote the minimum number of particles required 
to approximate F with a residual of less than e. If we can compute minN we 
can use a line search to compute minEspilon. 

To compute minN(F, e) for an increasing step function F with F(— oo) = 
and .F(oo) = 1, one needs to find the minimum number of steps in a 
similar increasing step function G that is never further than e away from F 
in the L°° metric. One constructs candidate step functions G by starting 
with G{— oo) = and then moving along the x-axis adding in additional 
steps as required to remain within a distance e. An optimal G is found by 
adding in steps as late as possible and, when adding a new step, making it 
as high as possible. 

In this way we can compute minN and minEpsilon for step functions F. 
We can then compute bounds on these values for a given distribution by 
approximating its cumulative density function with a step function. 

As can be seen, the exponential and mixture projection filters have similar 
accuracy as measured by the Levy residual and it is impossible to match this 
accuracy using a model containing only 3 particles. 

9 Conclusions and Future Research 

Projection onto a family of normal mixtures using the L 2 metric allows one to 
approximate the solutions of the non-linear filtering problem with surprising 
accuracy using only a small number of component distributions. In this re- 
gard it behaves in a very similar fashion to the projection onto an exponential 
family using the Hellinger metric that has been considered previously. 

The L2NM projection filter has one important advantage over the HE 
projection filter, for problems with polynomial coefficients all required inte- 
grals can be calculated analytically. Problems with more general coefficients 
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can be addressed using Taylor series. One expects this to translate into a 
better performing algorithm — particularly if the approach is extended to 
higher dimensional problems. 

We tested both filters against the optimal filter in simple but interest- 
ing systems, and we provided a metric to compare the performance of each 
filter with the optimal one. We also tested both filters against a particle 
method, showing that with the same number of parameters the L2NM filter 
outperforms the best possible particle method in Levy metric. 

We designed a software structure and populated it with models that make 
the L2NM filter quite appealing from a numerical and computational point 
of view. 

Areas of future research that we hope to address include: the relationship 
between the projection approach and existing numerical approaches to the 
filtering problem; the convergence of the algorithm; improving the stability 
and performance of the algorithm by adaptively changing the parameteriza- 
tion of the manifold; numerical simulations in higher dimensions. 
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